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We investigate the fine structure of the sp3 hybridized covalent bond geometry that governs the 
tetrahedral architecture around the central C a carbon of a protein backbone, and for this we develop 
new visualization techniques to analyze high resolution X-ray structures in Protein Data Bank. We 
observe that there is a correlation between the deformations of the ideal tetrahedral symmetry 
and the local secondary structure of the protein. We propose a universal coarse grained energy 
function to describe the ensuing side-chain geometry in terms of the Cp carbon orientations. The 
energy function can model the side-chain geometry with a sub-atomic precision. As an example we 
construct the C a -Gp structure of HP35 chicken villin headpiece. We obtain a configuration that 
deviates less than 0.4 A in root-mean-square distance from the experimental X-ray structure. 
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I: INTRODUCTION 

Protein structure validation is based on various well 
tested and broadly accepted stereochemical paradigms. 
Methods such as MolProbity [T] and Procheck [5] and 
many others help crystallographers to find and fix poten- 
tial problems during fitting and refinement. Stereochemi- 
cal assumptions are also instrumental to structure predic- 
tion packages such as Rosetta and I-Tasser [3]. Likewise, 
they form the foundation for parameter determination in 
force fields such as Charmm and Amber [3] that aim to 
describe protein dynamics at atomic scale. 

One of the paradigms is the transfersability assump- 
tion. It states that stereochemical restraints are universal 
and independent of the environment. Among its conse- 
quences are that the covalent bond geometry around the 
backbone C a should seize a very precise tetrahedral sp3 
hybridized shape. For example, the backbone 

r NC = (N-C a -C) 

bond angle should oscillate around a computable average 
value that depends only on the covalent bonds between 
the C Q and the N, C, H and Cp atoms in the trans- 
peptide group. In particular, at least to to the leading 
order its value should not depend on the character of the 
secondary structure environment. Standard molecular 
dynamics force fields explicitly assume this to be the case. 
These force fields are based on a harmonic approximation 
where the bond angles k oscillate with energy [3] 

E b ond= ^2 ^ K (k-K ) 2 (1) 
bonds 

Here uj k and kq are parameters that are in general amino 
acid dependent. But these parameters are presumed to 
be independent of the geometry of the surrounding sec- 
ondary structure. Instead, they are supposed to predict 
the local secondary structure environment. 



The enormous success that has been enjoyed by the 
validation methods and structure prediction programs in 
resolving close to 80.000 crystallographic protein struc- 
tures that are presently in Protein Data Bank (PDB) 
[5] is a clear manifestation that the various paradigms 
are valid to a good precision. However, with the ad- 
vent of third-generation synchrotron sources of X-rays, 
there is now a small but rapidly expanding number of 
protein structures that are resolved with an ultrahigh 
sub- Angstrom resolution. The present, third-generation 
X-ray synchrotron sources such as ESRF in Grenoble and 
PETRA at DESY in Hamburg can already produce pho- 
tons with wavelengths as short as 10 pico-meters. Thus 
it is in principle possible to obtain three dimensional pro- 
tein structures with a comparable resolution. The next- 
generation sources of high brilliance X-ray beams such 
as the European X-Ray Free Electron Laser at DESY, 
will push protein X-ray crystallography to its extreme. 
These future experimental facilities can reach both ultra- 
high spatial and temporal resolutions, with a fully co- 
herent peak brightness that is many orders of magnitude 
higher than what can be obtained with the present third- 
generation synchrotron sources. The on-going experi- 
mental revolution in combination with the ever expand- 
ing need of higher precision for example in the study of 
protein-protein interactions, enzyme catalysis and search 
of causes for protein misfolding related diseases, are good 
incentives for us to scrutinize the level of precision in 
some of the paradigm assumptions on protein backbone 
geometry. And, if need be, to try and develop new the- 
oretical concepts that aim to describe proteins at a pre- 
cision that matches the highest present and near future 
experimental standards, in revealing the finer structures 
of folded proteins. 

In fact, ab initio quantum mechanical calculations [6] 
and empirical studies [7]- [9] of protein backbone geome- 
try have already disclosed that the backbone bond angle 
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tnc = (N-C Q -C) about the C a carbons might oscillate 
quite substantially. The range of variations can be as 
large as 8.8° [8]. This corresponds to a shift of ~ 0.6 
A in the relative positioning of two consecutive C a car- 
bons. A deviation of this size from the ideal value can 
be subjected to experimental scrutiny in X-ray experi- 
ments that reach sub-Angstrom resolution. Indeed, on 
the basis of existing data the authors [7]- [9] have already 
reported that the deviations in the values of the t^c an- 
gle are systematic, and in particular that these deviations 
reflect the local secondary structure. 

The tmc angles are primarily affected by the back- 
bone. As such, their values relate directly to the two 
standard Ramachandran angles, that form the basis for 
structure validation. As a consequence, the literature 
[5]- [9] has until now mainly concentrated on the effects 
that potential deviations of rjvc from ideality have on 
the backbone geometry. Here we extend this analysis to 
the side-chains: The fluctuations in the lengths of the 
covalent bonds in the C Q tetrahedron arc no more than 
around 0.1 A which is much less than the potential ^0.6 
A shift in the relative positioning of two consecutive C Q 
carbons, due to rjvc fluctuations [7] -[3]. This proposes 
that any deviation of tnc from its ideal value inevitably 
propagates to the side-chain dependent r^p = (N-C Q - 
C^) and r C p = (C-C a -Cp) bond angles, and this should 
lead to observable effects in the angular positions of the 
side-chain atoms. 

In this article we first analyze PDB data to find 
whether there are experimental variations in the tetra- 
hedral angles around the Cq,. In particular, we extend 
the analysis of [7j-[9] to study correlations between the 
side-chain dependent angles r^p and Tcp that determine 
the Cp orientations, and the local secondary structure of 
the backbone. Since the side-chain atom positions are not 
easily described in terms of the backbone Ramachandran 
angles, we start by developing new visualization tools. In 
line with [7]- [5] we observe that the local secondary struc- 
ture has a systematic effect on the relative tetrahcdral 
position of the Cp carbon. We then proceed to utilize 
our visualization tools to develop theoretical arguments. 
We propose a coarse-grained framework that computes 
how the observed direction of the evolves along the 
backbone. In particular, we argue that the direction of 
the Cp can be computed from the soliton solution of a 
discrete nonlinear Schrodinger (DNLS) equation. The 
DNLS soliton already shares a remarkable history with 
protein research [10]. Both the DNLS equation and its 
soliton solution were first introduced by Davydov to de- 
scribe the propagation of energy along a- helices [TT] . He 
also proposed that since the propagation leads to a local 
deformation of the protein shape, a trapped soliton is a 
natural cause for the protein to fold. Here we first ar- 
gue on general grounds that the DNLS soliton solution 
can be utilized to determine the secondary structure de- 
pendence in the relative direction of the atoms along 



the backbone. We then consider an explicit example to 
illustrate our general arguments. The example we con- 
sider is the 35-residue subdomain of the villin headpiece 
with PDB code 1YRF. It is a paradigm protein that has 
been studied widely in theoretical approaches to protein 
folding. 



II: VISUALIZATION OF THE C a 
TETRAHEDRON 

We start by visual analysis of crystallographic protein 
data in PDB. The goal is to reveal any secondary struc- 
ture dependence in the values of tjvc and in the adjacent 



and 



T N p EE (N-C a -Cp) 



Tcp = {C — C a — Cp) 



bond angles. In order to minimize any bias, we inspect 
several subsets of PDB. These include the canonical one 
that comprises all PDB configurations with resolution 
2.0 A or better, and its two subsets with resolution bet- 
ter than 1.5 A, and better than 1.0 A. We also inspect a 
subset of the 2.0 A set that contains only those proteins 
that have less than 30% sequence similarity. Our conclu- 
sions are independent of the data set, and for illustrative 
purposes we here use the canonical 2.0 A set. There are 
presently over 30.000 such entries in PDB. 

The Ramachandran angles are defined in terms of the 
backbone amide planes. As such they are not the most 
convenient ones for describing the side-chain geometry. 
Since both r^p and Tcp relate to the side-chain geome- 
try, we prefer to follow [12] and describe the folded pro- 
tein structure in terms of the geometrically determined 
backbone discrete Frenet frames (DFF). These frames 
govern the entire backbone neighborhood, including the 
side-chains. But their construction involves only the C Q 
coordinates where i — 1,...,N label the residues. As 
such, these frames then give a manifestly N, Cp and C 
independent, purely geometric description of the tetra- 
hedral sp3 neighborhood of the Cq, atoms. 

The backbone tangent vectors are 

(2) 



The unit binormal vectors are 



The unit normal vectors 



ii, = b, x t« 



(3) 



(4) 



The orthogonal triplets (n^, b i; tj) determine the discrete 
Frenet frame at each of the positions Yi of the C a car- 
bons. Note that if the tangent vectors and the distances 
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between the C a are known, we can reconstruct the entire 
C Q backbone using 



fc-1 

El 

i=l 



*i+l - r t \ 



(5) 



For the initial condition we can utilize Galilean invariance 
to take ri = 0, and ti to point into the direction of the 
positive z-axis. In particular, ([5| does not involve the 
vectors and b^. 

We introduce the backbone bond angles 



COSK i+ i = COSKi+l t i = U+i ■ U 
and the backbone torsion angles 

bj+i • b 4 



H+l 



H+l,i 



(6) 



(7) 



Note that both angles are manifestly independent of the 
N, and C atoms that are covalently bonded to the 
C a atoms, in this sense they provide an unbiased set of 
coordinates for describing the positions of these atoms. 
If these angles are known, we can use 
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cos k cos t cos k sm t — sin k 

— sin t cos t 
sin k cos t sin k sin r cos n 




(8) 



to iteratively construct the Frenet frame at position i + i 
from the frame at position i. Once we have all the frames, 
we can proceed to construct the entire backbone using 

The bond and torsion angles have a natural interpre- 
tation in terms of the canonical latitude and longitude 
angles of a two-sphere § 2 . In the sequel we find it useful 
to extend the range of Ki to [— 7T, tt] mod(2n). But we in- 
troduce no change in the range of Tj G [— 7r,7r] mod(2Tr). 
We compensate for this two-fold covering of S 2 by the 
following discrete Z2 symmetry 



Kfe — > — Kk for all k > i 
n -t Ti - n 



(9) 



It inverts the directions of the vectors and but has 
no effect on the t, and consequently leaves the backbone 
intact. For details we refer to [12) . 

We use the discrete Frenet frames to display each atom 
in the way, how the atom is seen on the surface of a 
sphere that surrounds an imaginary observer who roller- 
coasts the backbone along the C a atoms, so that the gaze 
direction is always fixed towards the next C Q and with 
local orientation determined by the DFF frames |12) . 

In Figure 1 we show the statistical angular distribution 
of the backbone N and C atoms, and in Figure 2 we show 



the same for the side-chain atoms in our PDB data set 
as seen by the Frenet frame observer who moves through 
all the proteins in our data set. The sphere is centered 
at the C Q , and its radius coincides with the length of the 
(approximatively constant) covalent bond. We take the 
vector t that points towards the next C a to be in the 
direction of the positive z-axis, towards the north-pole of 
the sphere. With n in the direction of positive s-axis we 
have a right-handed Cartesian coordinate system. We 
introduce the canonical spherical coordinates (0, if) to 
describe the distributions. The angle 6 £ [0, tt] measures 
latitude from the positive z-axis, hence it describes the 
distribution of the bond angles Ki. The angle tp € [— 7r, tt] 
measures longitude in a counterclockwise direction from 
the s-axis i.e. from the direction of n towards that of b, 
with ip = at the x-axis. Consequently it describes the 
distribution of the torsion angles r^. 




FIG. 1: (Color online) The directions of a) backbone N- 
atoms and b) backbone C-atoms as seen by a Frenet frame 
observer located at the C Q carbon which is at the center of 
the sphere. In a) the smaller, more point-like direction of 
backbone N atoms corresponds to the L-q Ramachandran re- 
gion. The larger region forms a segment of the great circle 
ip w —15°. Loops interpolate latitudinally between a-helices 
and /3-sheets. In b) the directions of backbone C form a seg- 
ment of a small circle around z-axis, with 9 w 20°. 




FIG. 2: (Color online) The distribution of the Cp directions 
in the Frenet frame. In a) we have all amino acids (including 
proline but excluding glycine that has no Cjj). In b) we show 
only proline. Comparison between a) and b) exemplifies how 
the C/3 direction can depend on the individual amino acid. 
We have chosen proline in b) as it is particularly interesting 
due to the way how it appears in Figure 3b). 
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We find that in the Frenet frame coordinate system, 
the N and C oscillations shown in Figures la) and lb) 
are fully separated into the locally orthogonal 9 and ip 
directions, respectively. This would certainly not be the 
case in a generic coordinate system. Furthermore, sec- 
ondary structures such as a-helices, /3-sheets, loops and 
left-handed a-regions are all clearly identifiable in Fig- 
ures 1. Figure 2a) reveals how the N and C oscillations 
of Figure 1, through the covalent bonds that form the 
sp3 tetrahedron around C a , combine into a horseshoe 
(annulus) shaped nutation of Cp. As visible in the Fig- 
ure, this nutation reflects the local secondary structure 
environment in an equally systematic manner as Figures 
1. 




Tnc Tnp Ten 

FIG. 3: (Color online) The normalized probability density 
angular distribution of the a) tnc, b) r^p and c) rep angles in 
degrees, with a-helices in red (grey), /3-strands in blue (dark 
grey), and 3/10 helices in yellow. The secondary peak in b) is 
due to cis-peptide prolines. Only in Figure a) are the different 
secondary structures visibly separated from each other. In 
Figures b) and c) they are practically fully overlapping, the 
only notable effect is the somewhat higher propensity of 3/10 
helices in connection of cis-peptide prolines. 

The pattern of angular separation in the N and C os- 
cillations in Figure 1 reveals that the underlying ideal 
tctrahedral covalent symmetry around C a is not trans- 
ferable along the protein backbone [7]- [9|. 

We proceed to Figures 3 a)-c) where we plot the tetra- 
hedral bond angles t^c, t c/3 and tn/3 jointly for the a- 
helices, /3-strands and 3/10-helices. As in figure 1 the 
loops interpolate continuously between these regular sec- 
ondary structures. The Figure 3a) clearly reveals that at 
the level of the t^c angle the transferability of the tetra- 
hedral symmetry is absent in a systematic and secondary 
structure dependent manner. But neither Tcp nor r^p 
show any sign whatsoever of secondary structure depen- 
dence. The distributions are practically the same, inde- 
pendently of the secondary structure. (The isolated small 
peak in Figure 3b) is due to proline c?s-peptide groups.) 

As such, the distribution in both Figures 3b) and 3c) 
is what we would expect in the case of ideal, transferable 
sp3 tetrahedral symmetry. In each Figure the average 
value is around 111° and there are secondary structure 
independent fluctuations that are in line with quantum 
mechanical estimates and [B] and empirical studies [7]- 



[3]. But the fact that tjvc in Figure 3a) displays clearly 
visible and systematic secondary structure dependence 
makes it plain and clear that the paradigm of transfer- 
able tetrahedral covalent symmetry around the C a car- 
bon is absent. Furthermore, the way how transferability 
becomes violated reflects the wider secondary structure 
environment of the amino acid along the protein back- 
bone. 

Since the deviation from the ideal tetrahedral symme- 
try is organized in the same way how the proteins are 
folded, these two must share a common origin. But we 
do not have any physical explanation why the lack of ideal 
symmetry is only visible in tjvc- We suspect this has to 
do with existing experimental refinement methods, the 
way how refinement tension is distributed between the 
backbone and side-chains. The high resolution crystallo- 
graphic data which becomes available in future third and 
fourth generation experiments should help to clarify this. 

Ill: SOLITONS AND SIDE-CHAINS 

Any molecular dynamics approach to protein folding 
that we are familiar with, utilizes the harmonic approx- 
imation ([I]) for bond and torsion angles. Here kq is in 
general amino acid dependent but secondary structure 
independent equilibrium value of the bond angle. But 
from Figure 3a) we observe, that for the t?jc angle there 
are three different major equilibrium values. These equi- 
librium bond angle values are amino acid independent, 
but do depend in a nontrivial manner on the local sec- 
ondary structure: The three peaks in Figure 3a) corre- 
spond to the a-helix, /3-strand, and 3/10- helix while for 
a loop the values of the corresponding equilibrium kq in- 
terpolates between these three ground state values. Since 
each of these secondary structures are characterized by a 
different equilibrium value of bond angle k, to the leading 
order we may take Kq to be a function of k. By expanding 
to leading order we get 

where the k^ 1 ' are independent of the local value of n. 
The first two terms simply renormalize the values of ui K 
and kq in |T]). But the third term is conceptually dif- 
ferent. It introduces an anharmonic correction. We con- 
clude that after redefinitions of the parameters, in the 
leading order the potential obtains the functional form 

E bond ~ q.( K 2 -m 2 ) 2 (10) 

We argue that based on the Figure 3a), the anharmonic 
corrections are already visible in the existing high res- 
olution X-ray data. In order to answer the theoretical 
challenge that this poses, we propose to improve existing 
MD force fields, to account for the anharmonic correc- 
tions in the bond angle contribution. 
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A: Backbone energy 

The presence of an anharmonic correction in the bond 
angle energy has important implications to the way how 
proteins fold. For this we start with a simple example. 



We consider the anharmonic potential ( 10 ) in the pres- 



ence of a single coordinate x on a line. The Newton's 
equation is 



rax 



dV 

dx 



We take the potential to have the form 



V(x) = - (x + b) 2 -(x-a) 2 



(11) 



We introduce 



and define 



y = x 



-(b + a) 



(a-b) 



to arrive at the equation of motion 

a 



y 



-y(y 



2 c 2 ) 



which is essentially the continuum nonlinear Schrodinger 
equation (NLSE) [TU], [IT] Note that the potential has 
the symmetric form (10 1. This equation is solved by 



y(t) = c-tanh[c^/^-(t-to)] 



<t) = y(t) + ^(a~b) 



(or x = b) through any kind of continuous finite energy 
transformation. In particular, a soliton configuration 
such as |l2"] ) can not be obtained from any approach that 
only accounts for perturbations that describe small local- 
ized fluctuations around the uniform background ground 
state. 

In [13]- [H], it has been shown that the soliton pro- 
file ( 12 ) can be used to describe loops in folded proteins. 



For this one merges general geometric arguments with 
the concept of universality [H]-[IH], to arrive at the fol- 
lowing simplified, coarse-grained energy function for the 
backbone bond and torsion angles [20], |2T] . 



E : 



AT-1 

£ 

i=l 



~~2 



2 Ki+iKi 



2 2 



N 

£ 

i=l 



2\2 



2 K 2 + q ■ (k 2 - m 2 ) 



a T Ti + 



(13) 



where «j and are the backbone bond and torsion angles 
([6]), Q. Unlike force fields in molecular dynamics, the 



energy function ( 13 ) does not purport to explain the fine 



details of the atomic level mechanisms that give rise to 
protein folding. Instead, in line with general principles of 
effective Landau-Lifschitz theories it describes the prop- 
erties of a folded protein backbone in terms of universal 
physical arguments. Indeed, according to the concept 
of universality [16]- [19] the energy function (13 1 can be 



viewed as the universal long distance limit that emerges 
from any atomic level energy function when the internal 
energy is coarse-grained to include only the backbone 
bond and torsion angles. 

In order to construct the soliton solution, we start by 
introducing the r-equation of motion 



dE 



a T + c T Ti = 



b-e c 



C V2^(*-*0) 



2Sf (*-*<>) 



cosh[c 



lit - t Q )} 



(12) 



This is the hallmark NLSE soliton configuration, so called 
dark soliton solution of the NLSE equation. It interpo- 
lates between the two uniform ground states at x = a 
and x = —b when t — > ±oo. The parameters a, b, to and 
the combination 



are the canonical ones that characterize the asymptotic 
values of x(t) i.e. minima of the potential, and the size 
and location of the soliton. 



For finite t the soliton (12 1 describes a configuration 
with an energy above the uniform ground state x = a 
(or x = b). Nevertheless, it can not decay into x = a 



a T + b T K 2 



(14) 



Notice that even though there are four parameters in ( 14 ) 



one of them, the overall scale, drops out. We then use 



( 14 1 to eliminate the torsion angles, so that the energy 
for the bond angles becomes 

N-l N 

E[k] = - E 2 + £ 2 ^ + V ^ 



where 



V[k] = - 



b T c T — a T d T 
d T 



1 



z T + d T n 2 



2b T 



8qm 2 i2, 4 
K + q ■ K 



(15) 



(16) 



G 



Because the first term contains k in the denominator, its 
variation with k is not that pronounced as the variation 
of the other two terms, which are proportional to the 
second and the fourth power of k, respectively. Moreover, 
because \k\ > 1 radian for proteins, it turns out that the 
first term is small in value compared to the other terms. 
The second and third terms have then the functional form 



of the double well potential (10 1. 



In applications to folded proteins the parameters val- 
ues are such, that in the energy ground state both k and 
t acquire a non-vanishing value. In particular, since the 



and for a-helices and /3-strands the /uj 2 values are deter- 
mined by (17), (18). We remind that negative values of 
related to the positive values by ^ . Note that for 
Hi = /i2 and G\ = (T2 we recover the hyperbolic tangent. 
In this case the two regular secondary structures before 
and after the loop are the same. Moreover, only the (pos- 
itive) a\ and (7 2 are intrinsically loop specific parameters, 
they specify the length of the loop and as in the case of 
the /ii,2, they are combinations of the parameters in ( |13| . 
Similarly, in the case of the torsion angle there is only 



one loop specific parameter in ( 14 ): The overall, common 



functional form of (16) is similar to (10), (11) we can scale of the four parameters is irrelevant in ( 14 ), and two 



expect that there are soliton solutions: 

Geometrically, a uniform constant value of the bond 
and torsion angles describes regular protein secondary 
structures. For example, the standard a-helix is 



helix 



and for the standard /3-strand we have 
(3 — strand : 1 



(18) 



The additional regular secondary structures including 
3/10 helices, left-handed helices etc. are described simi- 
larly. 

But in addition of constant value configurations, as in 



(11) there are also soliton solutions. In particular, since 



protein loops are structures that interpolate between dif- 



ferent constant values such as (17), (18), this means that 
loops correspond to these soliton solutions [T3]-[T3]. In 
order to construct the relevant soliton, we introduce the 
generalized discrete nonlinear Schrodinger (DNLS) equa- 
tion that derives from the energy (13). Variation of this 



energy w.r.t. Ki and substitution of (14) gives 



— 2k,; 



(i=l,...,N) (19) 



where kq — k^v+i = 0. The exact soliton solution to 
the present discrete nonlinear Schrodinger equation is not 
known in a closed form. But numerical approximations 
can be easily constructed using the procedure described 



in [14 . Furthermore, whenever the first term in ( 16 ) is 
small as it is in the case of proteins, an excellent approx- 
imation 1151 is obtained from the naive discretization of 



the continuum soliton (12) 



Mi • e 



ui (i-s) 



• e 



-o- 2 (i-s) 



■ (20) 

Here s is a parameter that determines the backbone site 
location of the center of the fundamental loop that is de- 
scribed by the soliton. The /zi,2 € [0, tt] are parameters, 
their values are entirely determined by the adjacent he- 
lices and strands: Away from the soliton center we have 



Mi 

-M2 



I > s 
i < s 



of the remaining three parameters characterize the reg- 
ular secondary structures that are adjacent to the loop, 



as in (17), (18) 



Entire protein loops can be constructed by combining 
(fl9l, (flil. In [HI it has been shown 



together solitons 



(17) using the Ansatz (20) that over 92% of crystallographic 



PDB configurations can be constructed in terms of 200 
explicit soliton profiles. The solitons of the DNLS equa- 
tion can thus be interpreted as the modular building 
blocks of folded proteins. 



B: Side-chain energy 



We proceed to extend the energy function ( 13 ) so that 



it models the deviations from the paradigm tetrahedral 
symmetry around the C a atoms: The Figures la) and lb) 
reveal that the directions of the backbone N and C atoms 
oscillate in the latitudinal (8) and longitudinal (ip) direc- 
tions respectively, on the surface of the sphere that sur- 
rounds the corresponding C a atom. The covalent bond 
structure that forms the sp3 tetrahedron of the C a atom 
combines these two oscillations into the annulus (horse- 
shoe) shaped Cp nutation of Figure 2a). Consequently 
the natural dynamical variable that describes the nuta- 
tion of on the surface of the sphere is the canonically 
parametrized three component unit vector 



' sin 8 ■ cos <p < 
11 = I sin 8 ■ sin ip 

cos 8 



(21) 



In order to account for the nutation contribution to 



the protein free energy, we then need to augment ( 13 ) by 



terms that engage the additional variables (8i,ipi). 

The latitude angle 8t is counted from the direction of 
the corresponding Frenet frame tangent vector t^. Con- 
sequently it remains invariant under the rotations of the 
local Frenet frames around the direction of [12] . Thus 
it can only couple to other frame rotation invariant quan- 
tities. There are two natural terms, 



and 



t.,- x u,,- = sin( 



COS 8; 
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In the leading order we only account for local interac- 
tions. When we also demand invariance under the Z2 
gauge transformation ^ we conclude that to the leading 
order the corresponding free energy contribution should 
have the form 



E e = |tj x Uj| +ffi(ref)ti • 



(22) 



i=l 



According to Figure 2a) the range of variations in 9i 
are quite small and we estimate that the center of the 
annulus-like region is near 



< V > 



113.4° 



We Taylor expand ( 21 ) around this value so that we have 
to the leading order 



Eg — 



N 

E 

i=l 



— bgnfOi — agt 



The ensuing equation of motion is 

bgnf 



ag 



C0 + dglif 



(23) 



(24) 



As in the case of ( 14 1 we conclude that the overall scale 



of the parameters drops out and this leaves us with three 
independent parameters. In the case of a short loop that 
we can model in terms of a single soliton like (20 1, two 



of the parameters become determined by the value of 
9i in the regular secondary structures that are adjacent 
to the loop. This leaves us with only one loop specific 
parameter. 



The longitude tfi in (21 ) is measured from the direction 

Under the local 



of the Frenet frame normal vector n 
rotations of the Frenet frames 



'Hi 



f cosAi sinAi 
1 — sin Aj cos Aj 



around the tangent vectors t% by an angle Aj we then 
have 

<Pi -> <Pi + Aj 
Thus we may couple tp to the torsion angle as follows, 



<Pi 



k=l 



This combination is invariant under the local rotations of 
the Frenet frame around tj. Since the Tk depend on the 



backbone angles according to ( 14 ) , we can again Taylor 
expand the ensuing energy contribution. From Figure 
2a) we estimate that for the center of the annulus 



< tp > 



139.5 (o) 



Following ( 23 1 we then Taylor expand the tp contribution 



to free energy around this value to conclude that to the 
leading order we have (in Frenet frames) 



N ( A 



2 2 7 2 1 2 I 



(25) 

The equation of motion has the same functional form 



with (14), (24) 



(26) 



Again only three of the four parameters in tp. are inde- 
pendent, the overall scale drops out. 



We confirm that the functional forms (p3| and (25 1 are 



in line with the annulus-like (horseshoe-like) form of the 
Cp nutation in Figure 2a). For this we stereographically 
project the Cp distribution in Figure 2a) onto the com- 
plex plane. Despite the nonlinear nature of the standard 
stereographic projection the annulus-like shape is more 
or less retained. Let r be the approximate radius of a thin 
annulus on the complex plane and let (Oq^q) be the lo- 
cation of its center. In the limit where the corrections 
to the round circular profile of the thin annulus become 
small we can determine the approximate form of the 
nutation region from 

(tanfle^ - tan0 o e i¥, °)(tan0e- iv - tanfloe"^ ) = r 2 

(27) 

Note that this is invariant under local frame rotations. 



We re- write 9 in ( 24 ) as follows 



1 



= o 



dn 2 



We substitute this into (27) and Taylor expand to find 



that to leading order it makes sense to parametrize ip by 



an expression of the functional form ( 26 ) 



We make the following remark: When we combine 



(13), (23) and (25) we arrive at the total energy 

2 + ?-K 2 



AT-l N , 

i=i »=i 



2\2 



N (A 

St 



22 1 2 



-T 2 



N 



/ 1 2 " k 2 9 2 - bgn 2 9i - a s 8. l , , } 



(28) 



(29) 



(30) 



N 



I -f ~ h v K i^ - a v<£i + yV? } (31) 

i=l 
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We have already established that protein backbones can 



be described in terms of soliton solutions to ( 28 1 , ( 29 1 . 
According to ( fl4| ), pj| , ( [26] ) the presence of (|30|) and 
(31 1 does not change the functional form of the effective 



Ki energy (151, (16), all three variables (r i; 6* i; ipA are 
similarly slaved to the bond angles k^. In particular, from 



Figure 2a) we conclude that the contribution of ( 30 1 and 
(31) to the full energy must be minuscule: The range 
of variations in the variables (6t,ipi) is relatively small. 
(This is not the case with Tj, see for example Figure 4 



below.) Thus the values of (30) and (31 1 show very little 



variation, and in comparison to (29) these two terms can 



be treated as if they were tiny perturbations. 



Indeed, in the case of proteins the two terms ( 30 1 and 



(31) make no contribution to the total energy that we 
are able to observe. The variables {9i,(pi) are entirely 
slaved by the DNLS soliton profile of the backbone bond 
angles Ki. Since the direction of the vector (21) that 
specifies the position of the carbon is slaved to k^, 
the deviation from the ideal tetrahedral symmetry in the 
C Q covalent bond geometry is determined by the local 
secondary structure environment of the amino acid. 



C: Comment on parameters 



The energy function ( 28 )-( 31 ) introduces eleven essen- 



tial parameters, when we account for the overall scales 
in|lf, ([24], According to [22 , no more than 200 

different parameter sets are needed to describe over 92% 
of high resolution structures in PDB with a precision 
of around 0.6 A in RMSD for the C Q . The solitons 
are like modular components from which the folded pro- 
teins are built. At the moment we do not have a method 
to compute the parameters directly from the sequence. 
However, even in its present form the approach can be 
subjected to a stringent experimental scrutiny: A typical 
super-secondary structure described by a soliton such as 
a hclix-loop-helix consists of around 15 amino acids. If 
we assume that the bond lengths are fixed, this leaves us 
with 60 unknown coordinates for the C a and Gp atoms. 



Since there are only 11 essential parameters in (28)-(31 1, 
we have a highly under-determined set of equations. Con- 
sequently the model is predictive, a comparison with ex- 
perimental structures is directly testing the physical prin- 



tensive studies both experimentally [33]- [25J and in silico 
[26]- [29], and [29] reports on a molecular dynamics con- 
struction with overall backbone RMSD accuracy around 
one Angstrom. 

In Figure 4 we have the (k{, Tj) spectrum that we com- 
pute from the PDB data of 1YRF. In the Figure 4a) we 
use the standard convention that bond angles take values 
in the range [0,7r]. In the Figure 4b) we have extended 
the range to [— 7r, tt]. This introduced the TL^ gauge trans- 
formation structure j9|. In Figure 4b) we have applied 
the gauge transformation to disclose the solitons. We 



clearly have two solitons with the DNLS profile ( 20 ), sep- 
arated from each other by regions with k ss ±1.57 and 
r 1 corresponding to the a-helix (17). Notice the irreg- 




b) 
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1.5 

1 
0.5 



-0.5 
-1 
-1.5 

-2 
-2.5 



FIG. 4: (Color online) The profile ofni (light blue) andr; 
(dark red) along the 1YRF background. We use PDB indexing 
of the sites. In a) the Ki % ELL" 6 restricted to [0, 7r] and in b) this 
region is extended to [— 7r,7r] using Q. 
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ciples on which (|28|-([31| is based, even though we are not regions. A priori we expect from ( 14 ) that the torsion an- 



yet able to compute the parameters from the sequence. gle should have a regular profile. However, the numerical 



IV: EXAMPLE: VILLIN HEADPIECE HP35 

As an example we consider the chicken villin headpiece 
subdomain HP35. We use the x-ray structure with PDB 
code 1YRF. The HP35 is a naturally existing 35-residue 
protein with three a-helices separated from each other 
by two loops. It continues to be the subject of very ex- 



values that we compute from ( 14 ) are not restricted to 
the fundamental range Tj G [— ir, n], they can take values 
beyond this range. The irregular structure of r% follows 
when we convert the values to the fundamental range, 
using 2tt periodicity of in the discrete Frenet equa- 
tion ([8]. Similarly we observe slight irregularity in the 
Ki profile. This can also be removed if we allow to take 
values beyond [— 7r, tt] and use the 2-7T periodicity. But in 
the case of 1YRF the improvement in the precision turns 
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out to be very small, and consequently we search for a 
solution of (19) by assuming that Ki € [— 7T, 7t]. 



In Figure 5 we show the distribution of the side-chain 
angles (9i,tpi) in YRF, by plotting the tips of the unit 
vector ( plj ) on the two-sphere of Figure 2. As expected, 
they are located in the a-hclix region of Figure 2a) except 
along the loops, where they are located outside of the 
regular structure regions. 




FIG. 5: (Color online) The directional distribution of the 
side-chain angles (#;, ipt). The background coincides with the 
annulus in Figure 2a). 



We start by solving the classical equations of motion 
for ^ from ( |19[ ). We then construct the remaining vari- 
ables (Ti,8, 

Since the ( 



,(fi) in terms of ft, using (14), (24) and (26); 
9i,ifi) contributions to the Ki potential (16) 
are minuscule, we ignore the corresponding parameters in 
constructing the solution to the DNLS equation for Ki. 
We use the iterative algorithm and procedure described 
in [30] . |14) . and our results are summarized in Figure 
6 and Table 1. We have been able to substantially im- 
prove the accuracy reported in |14j . in particular for the 
first soliton. We now reach a RMSD accuracy less than 
0.4 A even when we include the side-chain C« atoms. 
The result is clearly within the Debye- Waller fluctuation 
distance regime that we compute from the experimental 
B-factors in the PDB data. 

In Figure 6a) we display the distance between the com- 
puted and the experimentally measured C a atoms (ex- 
cluding the N and C terminals). The shaded region in 
Figure 6a) describes the 0.15A zero point fluctuations 
[22j around our solitons. For comparison, we also dis- 
play the experimental Debye- Waller B-factor fluctuation 
distances, obtained from the PDB data. Except for the 
end point of soliton 1 (residue 58), our soliton solutions 




FIG. 6: (Color online) Comparison between our soliton solu- 
tions in red (gray) and the experimental B-factor fluctuation 
distance of PDB data for 1YRF (black) along the backbone. 
In Figure a) for the C a , and in Figure b) for t he Cp where the 
experimental accuracy is estimated from ( 32 1. The shaded re- 



gion describes the 0.15 A zero point fluctuations around soli- 
tons. The cut in Figure a) at sites 13-14 is where the two 
solitons overlap (Phe-58 in PDB), and the empty space in 
Figure b) is due to glycine that has no Cp. 



describe the backbone well within the limits of experi- 
mental accuracy. 

In Figure 6b) we present our results for the nu- 
tation, in comparison with the experimental data. We 
also present an estimate for the experimental uncertain- 
ties that we estimate as follows: The experimental B- 
factors give an estimate for the absolute fluctuation dis- 
tance around the measured position. But now we are 
interested in estimating the (much smaller) relative error 
in the position of Gp with respect to the position of the 
ensuing C a . For this we introduce the relative B-factor 



B r 



\B n - Br- 



im 



In Figure 6b) we display the ensuing fluctuation distances 
that we have computed from the Debye- Waller relation 
using p2| ) in lieu of the B-factor. The precision of our 
computed results compare well with these experimental 
relative B-factor errors: For most of the sites the differ- 
ence is no more than the 0.15A estimate for zero point 
fluctuations. 
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parameter 


soliton- 1 


soliton-2 




0.459712 


0.995867 




4.5533320 


9.408796 


mi 


1.504535 


1.550322 


7T12 


1.512836 


1.535081 


a T 


9.5752137e-9 


7.840467e-6 


K 


-676965e-ll 


-4.973244e-9 


Ct 


4.875744e-9 


4.2733696e-6 


d T 


-2.917129e-9 


-2.431388e-6 


Clg 


1.514770 


1.322495 


be 


-0.0017952 


-0.018619 


da 


0420877 


6 930946e-8 


a v 


0.544859 


0.3594184 


bip 


5.66111e-5 


3.83253e-4 


d v 


-0.1845828 


-0.226012 


RMSD (A) 


0.38 


0.32 



TABLE I: Parameter values for the two-soliton solution that 
describes the two loops of 1YRF with a combined 0.39A accu- 
racy for both C Q and atoms. The displayed RMSD values 
are for the individual solitons. The soliton- 1 is located at 
Glu-45 - Phe-58 and the soliton-2 is located at Phe-58 - Lys- 
73. We utilize scale invariance to set all eg = c v = 1. The 
result has sensitivity to the accuracy of parameters, because 
a folded protein is a piecewise linear polygonal chain with a 
positive Liapunov exponent. 



Finally, Figure 7 shows our soliton solution together 
with the 1YRF configuration in PDB. 




FIG. 7: (Color online) A cartoon comparison of HP35 with 
our soliton solution summarized in Table 1. The combined 
G a and root-mean-square distance is 0.39 A which equals 
the experimental Debye- Waller B-factor fluctuation distance 
for the central carbons. 



SUMMARY 

In conclusion, the paradigm assumption that the tetra- 
hedral covalent symmetry around the backbone C a car- 
bons is transferable, is correct to a good precision. How- 
ever, with the advent of third and fourth generation X- 
ray sources there is now a rapid growth in the number of 
protein structures with sub-Angstrom resolution. This 
makes it possible to scrutinize small corrections to this 
paradigm. We have found, that the backbone N — C a — C 
bond angle shows systematic deviations from the ideal 
value, in a manner that is in direct correspondence with 
the corresponding secondary structure environment. We 
have investigated how this effect propagates to the orien- 
tation of the Cp carbon. We have found that the angu- 
lar orientations of the Gp carbon similarly deviate from 
their ideal values, in a manner which is in a one-to-one 
correspondence with the underlying secondary structure 
environment. 

We have presented a simple energy function that is 
based on the concept of universality, to model the sec- 
ondary structure dependence in the Cp orientations. As 
an example, we have constructed the Cc-C^ backbone of 
HP35 villin, where we reach an accuracy that matches the 
experimental B-factor fluctuation distances. We propose 
that our observations and theoretical proposals could 



form a basis for the development of both more accu- 
rate refinement tools for experimental data analysis, and 
of more precise theoretical and computational MD force 
fields, to model the atomic level structure and dynamics 
of folded proteins. 
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